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Bucketing Coding and Information Theory for 
the Statistical High Dimensional Nearest 

Neighbor Problem 

Abstract 

Consider the problem of finding high dimensional approximate nearest neighbors, where the data 
is generated by some known probabilistic model. We will investigate a large natural class of algorithms 
which we call bucketing codes. We will define bucketing information, prove that it bounds the perfor- 
mance of all bucketing codes, and that the bucketing information bound can be asymptotically attained 
by randomly constructed bucketing codes. 

For example suppose we have n Bernoulli 1/2) very long (length d — > oo) sequences of bits. Let 
n — 2m sequences be completely independent, while the remaining 2m sequences are composed of 
m independent pairs. The interdependence within each pair is that their bits agree with probability 
1/2 < p < 1. It is well known how to find most pairs with high probability by performing order of 
n log2 2 / p comparisons. We will see that order of n 1 / p+£ comparisons suffice, for any e > 0. Moreover 
if one sequence out of each pair belongs to a a known set of n( 2p_1 ^ 2_e sequences, than pairing can 
be done using order n comparisons! 

I. Introduction 

Suppose we have two bags of points, X and Xi, randomly distributed in a high-dimensional 
space. The points are independent of each other, with one exception: there is one unknown point 
x in bag X that is significantly closer to an unknown point x 1 in bag X± than would be 
accounted for by chance. We want an efficient algorithm for quickly finding these two 'paired' 
points. More generally, one could have m special pairs (up to having all points paired). An 
algorithm that finds a single pair with probability S will find an expected number of mS pairs, 
so keeping m as a parameter is unnecessary. 

We worked on finding texts that are translations of each other, which is a two bags problem 
(the bags are languages). In most cases there is only one bag X = Xi = X, n = n x = n. 
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The two bags model is slightly more complicated, but leads to clearer thinking. It is a bit 
reminiscent of fast matrix multiplication: even when one is interested only in square matrices, 
it pays to consider rectangular matrices too. 

Let us start with the well known simple uniform marginally Bernoulli(l/2) example. Suppose 
Xq, Xi C {0, l} d of sizes n , n\ respectively are randomly chosen as independent Bernoulli(l/2) 
variables, with one exception. Choose uniformly randomly one point x E X , xor it with a 
random Bernoulli (p) vector and overwrite one uniformly chosen random point x\ £ X x . A 
symmetric description is to say that x ,Xx i'th bits have the joint probability matrix 

p={ p/2 {1 - p)/2 \ a, 

\(l-p)/2 p/2 j 

for some known 1/2 < p < 1. In practice p will have to be estimated. 
Let 

InN = lnn + lnni- I(P)d (2) 

where 

I(P) = I(p) = p ln(2p) + (1 - p) ln(2(l - p)) (3) 

is the mutual information between the special pair's single coordinate values. Information theory 
tells us that we can not hope to pin the special pair down into less than N possibilities, but can 
come close to it in some asymptotic sense. Assume that N is small. How can we find the closest 
pair? The trivial way to do it is to compare all the n^ni pairs. A better way has been known 
for a long time. The earliest references I am aware of are Karp,Waarts and Zweig [7], Broder 
[3], Indyk and Motwani [6]. They do not limit themselves to this simplistic problem, but their 
approach clearly handles it. Without restricting generality let n < n\. Randomly choose 

k ps log 2 n (4) 



out of the d coordinates, and compare the point pairs which agree on these coordinates (in other 
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words, fall into the same bucket). The expected number of comparisons is 

n n 1 2~ k ps m (5) 

while the probability of success of one comparison is p k . In case of failure try again, with other 
random k coordinates. At first glance it might seem that the expected number of tries until 
success is p~ k , but that is not true because the attempts are interdependent. An extreme example 
is d — k, where the attempts are identical. In the unlimited data case d — > oo the expected 
number of tries is indeed p~ k , so the expected number of comparisons is 

W ~ p~ k n x ~ nJ, og2 1/p m (6) 

Is this optimal? Alon [1] has suggested the possibility of improvement by using Hamming's 
perfect code. 

We have found that in the n = m = n case, W ~ n log2 2//p can be reduced to 

W ~ n l/p+e (7) 

for any 1/2 < p < 1, e > 0. This particular algorithm is described in the next section. Amazingly 
it is possible to characterize the asymptotically best exponent not only for this problem, but for 
a much larger class. We allow non binary discrete data, a limited amount of data (d < oo) and 
a general probability distribution of each coordinate. 

We will prove theorem [TOT] a lower bound on the work performed by any bucketing algorithm. 
It employs a newly defined bucketing information function I(P, \ , Ai, //), which generalizes 
Shannon's mutual information function J(P) = I(P, 1, 1, oo). Comparing © with theorem FlO.il 
shows that the mutual information's function generalizes as well. Bucketing algorithms approach- 
ing the information bound are constructed by random coding. The analogy with Shannon's coding 
and information theory is very strong, suggesting that maybe we are redoing it in disguise. If it is 
a disguise, it is quite effective. Coding with distortion theory seems also related. There is related 
work [9], which tackles a particular class of practical bucketing algorithms (lexicographic forest 
algorithms). Their performance turns out to be bounded by a bucketing forest information 
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function, and that bound is asymptotically attained by a specific practical algorithm. 



II. An Asymptotically Better Algorithm 

The following algorithm does not generalize well, but makes sense for the uniform marginally 
Bernoulli(l/2) problem (OQ) with 1/2 < p < 1. Let < d < d be some natural numbers. We 
construct a d dimensional bucket in the following way. Choose a random point b E {0, l} d . The 
bucket contains all points x E {0, l} d such for exactly d — 1 or d coordinates i Xi — bi. (It is 
even better to allow do — 1, • • • , d, but the analysis gets a little messy.) The algorithm uses T 
such buckets, independently chosen. The probability of a point x falling into a bucket is 



Pa* 



d 

dn — l 




Let the number of points be 



n = n x 



n 



U/pa*J 



(8) 



(9) 



This way the expected number of comparisons (point pairs in the same bucket) is 



T(npA*) 2 < T 



(10) 



The probability that both special pair points fall at least once into the same bucket is 

d 



d ( 



m=0 



m 
[m/2\ 



P d - m (i- P y 

d — m 
do ~ \m/2] 



1 - (1 - s„ 



d — m 
d -\(m + l)/2] 



(ID 



(12) 



The explanation follows. In these formulas m is the number of coordinates i at which the 
special pair values disagree: xo,i ^ x\^. Consider the special pair fixed. There are 2 d possible 
baskets, independently chosen. Consider one basket. For j,k = 0,1 denote by rrijk the number 
of coordinates i such that x 0:i © b { = j and Xq i © X\ i = k where © is the xor operation. 
We know that m 01 + m u = m and m 00 + m w = d — m. Both x , Xi fall into the basket iff 
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m 00 + to i — d — l,d and m 00 + m n = d — 1, d . There are two possibilities 

/ 



m 00 m i 
mio mn 



[ moo 


m l 1 


( 




mn J 


V 



rf - \m/2] [m/2\ 
d-do- [m/2\ \m/2] 

d - r(m + l)/2l \m/2] 
d-do-[(m- 1)/2J L™/ 2 J j 



(13) 



(14) 



each providing 



m o + mio 
moo 



V 



m i 



(15) 



/ 



buckets. 

Clearly m obeys a Bernoulli(l — p) distribution, so by Chebyshev's inequality 



S > min ( 1 

\m-(l-p)d\<y/p(l-p)d/e 



e~ TSm - e". 



(16) 



for any < e < 1. Hence taking 



T = \- In e/ min 5 m ] 

| m- ( 1 -p) d| < y/pll^pjd/^ 



(17) 



guaranties a success probability S > 1 — 2e. What is the relationship between n and T? Let 

do~(l+p)d/2, rf^oo (18) 



By Stirling's approximation 



Inn T /l + p 
hm — — — 1 



d 

lim = pi 

d 



2 

1+p/p 



Letting p — > results in exponent 



lim 



InT 



(19) 
(20) 

(21) 



Inn p 

We are not yet finished with this algorithm, because the number of comparisons is not the 
only component of work. One also has to throw the points into the baskets. The straightforward 
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way of doing it is to check the point-basket pairs. This involves 2nT checks, which is worse 
than the naive n 2 algorithm! In order to overcome this, we take the k'th tensor power of the 
previous algorithm. That means throwing n k points in {0, l} kd into T k buckets, by dividing 
the coordinates into k blocks of size d. The success probability is S k , the expected number 
of comparisons is at most T k , but throwing the points into the baskets takes only an expected 
number of 2n k T vector operations (of length kd). Hence the total expected number of vector 
operations is at most 

T k + 2n k T (22) 



At last taking 



fc = rv(i-p)i 



(23) 



lets us approach the promised exponent I /p. 



III. The Probabilistic Model 

Definition 3.1: The pairwise independent identically distributed data model is the following. 
Let the sets 

X C {0, 1, . . . , b - l} d , X 1 G{0,l,...,b 1 - l} d (24) 
of cardinalities #X = n , #Xl = n\ be randomly constructed using the probability matrix 



Poo Poi 
Pio Vw 



Po &i-l 
Po &i-l 



(25) 



y Pbo-l Pb -1 1 • • • Pbo-l &i-l J 
6 -16i-l 

pjk > 0, Vjk = 1 

j=0 k=0 



(26) 



The X points are identically distributed pairwise independent Bernoulli random vectors, with 

61-1 

Pj* = Yl Vjk (27) 

k=0 
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probability that coordinate % has value j. The probability of a single point x G X is 

cZ 
i=l 

and the probability of a set B C X is of course 

Similarly Xl is governed by p* k = Y^o 1 Pjk There is a special pair of X , Xi points, uniformly 
chosen out of the n G ni possibilities. For that pair the probability that their i'th coordinates are 

j, k is pj k and for x E X , x\ G X x 

d 

Px Xl = \_\_Pxo,iXl i 

(30) 

1=1 

Coding and information theory were initially developed for a similar model (with a proba- 
bility vector instead of a probability matrix). Extension to non-uniform matrices, a stationary 
model with coordinate dependency, or continuous data is possible, as was done for coding and 
information theory. 

IV. Comparison with the Indyk-Motwani Analysis 

The Indyk-Motwani paper [6] introduces a metric based, worst case analysis. In general no 
average work upper bound can replace a worst case work upper bound, and the reverse holds 
for lower bounds. Still some comparison is unavoidable. Let us consider the uniform marginally 
Bernoulli(l/2) problem with d — > oo. We saw that the classical approach requires W « n log22 / p , 
and have reduced it to W as n e+1 / p . What is the Indyk-Motwani bound? The Hamming distance 
between two random points is approximately d/2 (the ratio to d tends to 1/2 as d grows, 
according to the law of large numbers). The Hamming distance between two related points is 
approximately (1 — p)d. Hence the distance ratio is c = 1/(2 — 2p) and the Indyk-Motwani work 
is 

W w n 1+1/c = n 3 - 2p (31) 
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It can be argued that the drop in performance is offset by the lack of pairwise independence 



assumptions. The n 1 + e_1/c = n 1 + e ' 2p ~' 2 lower bound of Motwani, Naor and Panigrahy [8] is 
interesting, but increasing it to n 1 ^ seems a challenge. 

Now let us consider a typical sparse bits matrix: for a small e let 



P = 



' 1 - 3e e ^ 



(32) 



V e 6 J 

The standard bucketing approach is to arrange the coordinates randomly and hash each point 
by its first k l'ns, where k ps — lnn/ln2e. The probability that two unrelated points fall into 
the same bucket is less than (2e) k ps l/n, so the expected work per try is approximately n. The 
probability that the two related points fall into the same basket is at least 

3e)"^ fc e fe =| I (l-3e) m - fe (3e) fc -3- fc (33) 





for any m > k (consider the first m coordinates). Taking m ps k/3e shows that the success 
probability per try is at least approximately 3~ fc ~ n in3/in2e fj ence m orc [ er to succeed we will 
make n - ln3 / ln2e tries, and the total expected work is 

W ps n 1+I ^ (34) 

In contrast the Hamming distance between random points is approximately 2(1 — 2e)2ed and 
the Hamming distance between two related points is approximately 2ed, so the Indyk-Motwani 
distance ratio is c = 2(1 — 2e) ^ 2 and 

W « n 1+1/c ps n 3/2 (35) 

This worst case bound does not preclude the possibility that the random projections approach 
recommended for sparse data by Datar Indyk Immorlica and Mirrokni [4] performs better. 
Their optimal choice r — > oo results in a binary hash function h(x) = sign (j2i=i XiC^J where 
(xi,x 2 i ■ ■ ■ , Xd) G X is a any point and C\, C 2 , . . . , Cd are independent Cauchy random variables 
(density v ^ z ^ )- Both ±1 values have probability 1/2, so one has to concatenate k ps log 2 n 
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binary hash functions in order to determine a bucket. Now consider two related points. They will 
have approximately ed l'ns in common, and each will have approximately ed l'ns where the 
other has zeroes. The sum of ed independent Cauchy random variables has the same distribution 
as ed times a single Cauchy random variable, so the probability that the two related points get 
the same hash bit is approximately 

Prob {sign (Ci + C 2 ) = sign (d + C 3 ) } = 2/3 (36) 

Hence amount of work is large: 

W « n(3/2) fc « n log23 (37) 

We have demonstrated that the probabilistic model adds to the current understanding of the 
approximate nearest neighbor problem. This is no surprise, since it is the standard model of 
information theory. 

V. Bucketing Codes 

Assume that there is enough information to identify the special pair. How much work is 
necessary? Comparing all n rii point pairs suffice. All the effective known nearest neighbor 
algorithms are bucketing algorithms, so will limit ourselves to these. But what are bucketing 
algorithms? One could compute m , mi in some complicated way from the data, and then throw 
the m 'th point of X and the mi'th point of X 1 into a single bucket. It is unlikely to work, but 
can you prove it? In order to disallow such knavery we will insist on data independent buckets. 
Most practical bucketing algorithms are data dependent. That is necessary because the data is 
used to construct (usually implicitly) a data model. We suspect that when the data model is 
known, there is little to be gained by making the buckets data dependent. 

Definition 5.1: Assume the i.i.d. data model. A bucketing code is a set of T subset pairs 

(Bo,o, Bifl), . . . , (_B 0i t-i, -Bi,t-i) C X x Xi 



Its success probability is 



(38) 



10 



and for any real numbers n , n x > its work is 

T-1 

W =Y^ m ax(^0PBo,t*) W lP*Bi, t ,^0PBo,t* w lP*Bo,t) 
t=0 

The meaning of success is obvious, but work has to be explained. In the above definition we 
consider n , n\ to be the expected number of X , X 1 points, so they are not necessarily integers. 
The simplest implementation of a bucketing code is to store it as two point indexed arrays of 
lists. The first array of size 6q keeps for each point x £ {0, 1, . . . , b — l} d the list of buckets 
(from to T — 1) which contain it. The second array of size bf does the same for the -B ljt 's. When 
we are given X and X 1 we look each element up, and accumulate pointers to it in a buckets 
array of k lists of pointers. Then we compare the pairs in each of the k buckets. Let us count 
the expected number of operations. The expected number of buckets containing any specific 
X point is Y,J=o PB 0tt *, so the X lookup involves an order of n + n J2jjoPB , t * operations. 
Similarly the X 1 lookup takes n\ + n\ Y,J=o P*B ltt The probability that a specific random pair 
falls into bucket t is PB . t *P*B 1 . t , so the expected number of comparisons is nop Bo ^* n iP*B ltt It 
all adds up to 

T-1 

n + ni + [ n oPB ,t* + n iP*B ht + ^oPB , t * n iP*Bi, t ] < n o + ni + (39) 

t=0 

The fly in the ointment is that for even moderate dimension d the memory requirements of 
the previous algorithm are out of the universe. Hence it can be used only for small d. Higher 
dimensions can be handled by splitting them up into short blocks, or by more sophisticated 
coding algorithms. 

VI. Basic Results 

Definition 6.1: For any nonnegative matrix or vector R, and a probability matrix or vector 
P of the same dimensions b x bx, let the extended Kullback-Leibler divergence be 

K{R\\P) = h f^ b f^r jk hi^- > (40) 

j=0 k=0 r **Pjk 

where r„ = E^ 1 EtV ^ 

Non-negativity follows from the well known inequality: 
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Lemma 6.1: For any nonnegative q , q x , . . . , g b _i > 0, Po,Pi, ■ ■ ■ ,Pb-i > 

E<Z>- >g*ln^ (41) 

j=o Pj P* 

where g* = £$=J g,- , = ES= Pj 

Definition 6.2: Suppose P is a probability matrix. We write that Ao, Ai < 1 < Ao + Ai are P 
sub — conjugate to each other, denoted by I(P, A , Ai, 1) = 0, iff for any probability matrix 
Q of the same dimensions as P 

K{Q..\\P.) > \ K{Q4P.*) +AiJC(Q,.||P,.) (42) 

Explicitly 

E E ^ ln — > A o E ^In^ + Ax E ^ln^ (43) 

j=o fc=o j=o Pi* k=o P* k 

where g,* = J2k=o Qjk etc. The set of P sub-conjugate pairs is convex by definition. 
We will prove in the section IVIIII 

Theorem 6.2: For any bucketing code with probability matrix P, set sizes rio,ni, success 
probability S and work W 

W >S sup n^n^ 1 (44) 

Ao^^l^Ao+Ai, 7(P,A ,A 1 ,1)=0 

The following inverse result is a special case of theorem 110.21 

Theorem 6.3: For any probability matrices P, Q, a scalar e > and large N there exists a 
bucketing code for matrix P, set sizes n = [N K( - Q ■*H- R *)J , m = [N K ( Q * ~^ p * , with success 
probability S > 1 - e and work W < N e+K ^ p l 



VII. An Example 

1 p/2 (1 - p), - \ lOO 

. Inserting Q = into 



Consider the classical matrix P 



\{l-p)/2 p/2 



1 



theorem [631 generates the well known n = n\ w N ln2 ,S > 1 — e and < jV £+ln2//p . 

The Q m P neighborhood is important. Setting q jk = p jk + Sjk, 5jk — > 0, 5** = results in 

n ~ N j 2p i* , m w iV 2 ^ 2 ".fc , S 1 > 1 - e and W < N 2p ^ k . Linear algebra shows that 
it is best to take 5 00 = —5n = 5, 5 W = — 5 i = ol8. Replacing N with N 2 ^ 2 and e with e5 2 /2 
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results in n « A^ 1 "^ 2 , n x « 7V( 1+Q ) 2 , 5 > 1 -e, W < N* ¥i /*+ c? H 1 - p '>. In particular for a = 
n Q = m = n, S > 1 - e, W < n e+1 / p . 

Is the exponent l/p best possible? Theorem 16.21 reduces the optimality of l/p to a single 
inequality: 

Conjecture 7.1: For any 1/2 <p< 1, g 00 , goi, gio, qn > 0, goo + goi + Ho + ?n = 1 

2p g o In H goi m H 9io m ; 1- 9u m > (45) 

L p 1 — p 1 — p p J 

> (goo + goi) ln2(g 00 + g i) + (gio + gn) ln2(g 10 + g n ) + (46) 

+ (goo + gio) ln2(g 00 + gio) + (gio + gn) ln2(gi + gn) (47) 
Computer experimentation and critical point analysis leave no doubt that this inequality is 
valid. It is four dimensional, and keeping the marginal probabilities fixed shows that we can 
further restrict 

(1 - p) 2 q m qu = p 2 q iqio (48) 
A brute force proof is possible. Hopefully someone will find a clever proof. 

In nQ + ln rti — 2(2p— 1) y/ln tiq In 

Expressing N, a in terms of no, n\ shows that we can do with e 4 P (i- P )(i-«o 

(2r>— 1)^ — e 

comparisons. In particular when n = n\ , that asymmetric approximate nearest neighbor 

problem is solvable in linear time! 

VIII. A Proof From The Book 
In this section we will prove theorem 16.21 

Theorem 8.1: For any probability matrices Pi, P 2 and Ao, Ai < 1 < A + Ai 

I{P U Ao, Xi, 1) = I(P 2 , Ao, Ai, 1) = 7 (Pi x P 2 , A , A x , 1) = (49) 

where x is tensor product. 

Proof: Direction <= is obvious, so assume the left hand side. Denote P = Pi x P 2 : 



Pjlklj2k2 Pl,jlklP2,j2k2 



(50) 
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For any probability matrix {q jlkl j 2 k 2 }j ktj 2 k 2 

E n i 1jikij 2 k 2 _ i gjifci** , \ - i Qjik 1 j 2 k 2 

j 1 k 1 j 2 k 2 Pl,jikiP2,j 2 k 2 j lkl Pl.jiki j 1 k 1 j 2 k 2 Hjik 1 **P2,j 2 k 2 

Because I (Pi, A , Ai, 1) = 

gjifci** in ^ 2^ <?n*** 111 r Ai tn pz; 

jifej Pijiki j x fci Pi,*fci 

Because /(P2, A , Ai, 1) = 

Qjikij 2 k 2 / Qjiki** ln ^ PjJ 

j 2 fc 2 P2,j 2 A:2 

Qjlki*k 2 / Qjiki** 

ln^^i^ (54) 

32 fc 2 P2 '* fc 2 

Ei_ Qjikij 2 k 2 > ^ „ 1^ 1jik!j 2 * 1 \ \^ „ !„ 1jiki*k 2 

i\k\hk 2 1jik 1 **P2,j 2 k 2 jikih ( ljik 1 **P2,j 2 * j 1 k 1 k 2 ( lj 1 k 1 **P2,*k 2 

(55) 

so with help from lemma [67T1 

Qjik 1 j 2 k2 m — ^ kl ^ 2k2 — > Ao E Qji*j 2 * m — J2 hA x ^ Q*ki*k 2 m — - fcl fc2 — (56) 

jlkij 2 k 2 Qjiki**P2,j 2 k 2 j x j 2 Qji***P2,j 2 * klk2 <l*k 1 **P2,*k 2 



Together 



Ei Qjlkij 2 k 2 ^ \ 1 Qji*j 2 * 1 \ \ ~> 1 Q*ki*k 2 /ct\ 
Qjikij 2 k 2 m -> Ao Qji*j 2 * m r A l 2^ Q*ki*k 2 in p /J 

jikij 2 k 2 Pl,jikiP2,j 2 k 2 j 1 j 2 Pl,ji*P2,j 2 * klk2 Pl,*kiP2,*k 2 

hence /(Pi x P 2 , A , A X) 1) = 0. ■ 

Theorem 8.2: For any P C {0, 1, . . . , 6 - Pi C {0, 1, . . . , b x - l} d 



Proof: Without restricting generality let d = 1. Inserting 



< min pIRs, ( 58 ) 



j eB ,ke Pi 

q 3 k=\ PB ^ (59) 
otherwise 



into (1421) proves the assertion. 
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Proof of theorem 16.21 Proof: Recall that the work is W = Y,i Wi where 



Wi = max (n PB 0ti *, n&* B i, v noPBo,i*niP*fl 0li 



Our parameters satisfy 



hence 



(A ,A 1 )GConv({(l,0),(0,l),(l,l)}) 



In ^ > A ln(n pso, l *) + Aj ln(nip* Blii ] 



(60) 
(61) 

(62) 
(63) 



Now sum up. 



IX. Bucketing Information 
All the results of this section will be proven in appendix |U 

Definition 9.1: Suppose P is a probability matrix. The bucketing information function 

is for fi > 



I(P, Xq, Ai, fj) = max 

> 0} 



6 6i— 1 6061-I 

A £ ^(^,*||^.*)+Ai £ + 

i=0 i=0 



< i < bobi 
< j < b 
< k < 61 



r* ** = 1 



+(i-^k(^..||p..)- £ tf(JM£.) 



i=0 



Explicitly^ = El=o n,jk, K(Ri,-*\\P-*) = Eto n^ln-p^- etc. 

Lemma 9.1: For any probability matrix P and < /x < 1 the sums in definition 19.11 can be 
restricted to a single term, i.e. 



I(P, Xq, Ai, //) = max 



A if(Q.*||P.*) + \iK(Q,.\\P,.) - fiK(Q..\\P. 



(64) 



where Q ranges over all probability matrices. For any \x > 0, not restricting the number of terms 
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i in definition 19.11 does not change I. It can be rewritten as 

I(P, A , Ai, /i) = max 
Q 

where Conv is the convex hull and 



1 - u)K(Q..\\P.) + max y 

(Q,jy)eConv(G(P,A ,Ai)) . 



(65) 



G(P, XoAi) = {{Q, \ K{Q4P.*) + \ 1 K{Q*,\\P*,)-K{Q..\\P..))} Q (66) 
From now on when dealing with the bucketing information function, we will denote Ya without 

worrying about the number of indices. 

Lemma 9.2: For any probability matrix P and ji > the bucketing information function 

I(P, Ao, Ai, /i) is nonnegative, convex, monotonically nondecreasing in Ao, Ai and monotonically 

non-increasing in fi. Special values are 

I(P,Xo,\ 1 ,IM) = fjLl(P,X /n,\i/fji,l) 0<fi<l (67) 
I(P, A , Ai,/i) = VQ, mm(n,l)K(Q..\\P..)) > A iT(Q.*||P*) + \iK(Q m .\\P m .) (68) 
I(P,A ,Ai,/i) =0 < A , Ai A + Ai < min(/i, 1) (69) 

I(P, 1, 1, a) = max In < a < 1 (70) 

0<j<6 ,0<fc<6 1 pj-,p, fc 

7(P, 1, 1, fi) = (ji- 1) In E E^i* M>1 (VI) 

j=0 fc=0 \Pj*P*kJ 

I(P, 1,1, 00) = /(P) = E £>;*ln-^- C72) 

j=0 fc=0 Pj*P*k 

Theorem 9.3: For any probability matrices Pi, P 2 and /x > 

/(Pi x P 2 , A , A x , //) = I(P U A , A 1; /i) + I(P 2 , A , A a , //) (73) 



X. Bucketing Codes and Information 

All the results of this section will be proven in appendix lU 

Theorem 10.1: For any bucketing code with probability matrix Pi x P 2 x 



x Pg, dimension 
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d = 1, set sizes n 0} n\, success probability S and work W 

d 

~ (74) 



In W > sup 

A ,Ai<l<A +Ai, fi>0 



A lnn + Ax In ri! + /iln S* — I (Pi, A , Ai, fi) 

i=i 

Definition 10.1: Assume the i.i.d. data model with probability matrix P. Suppose there exists 
a d dimensional bucketing code such that for the expected numbers n Q ,ni of X Q ,Xi points it 
has success probability S and work W. Then for any real numbers < S < S, W > W we 
say that (P, d, n ,ni, S, W) is attainable. Define the set of log — attainable parameters to 
be 



D(P) = | -(InnoMm, - In S,\nW) 



(P, d, n , rii, S, W) is attainable > (75) 



Normalizing by d is awkward in the infinite data case d = oo. There it makes sense to consider 
the log — attainable cone 

D (P) = Cone(P(P)) = U Q > aP(P) (76) 
Theorem 110.11 is asymptotically tight in the following sense: 

Theorem 10.2: For any probability matrix P the closure of its log-attainable set is 

D C {P) = {(m , mi ,s,w) \s>0, {11) 
V Aq, Ai < 1 < Ao + Ai, ii > w > Xorrio + Aimi — /is — I(P, Ao, Ai, fj,)} (78) 

Equivalently 

D C (P) = D(0) + Conv({(X;ir(iMP..), E^(^*-ll p *-), K(R*,.\\R.), (79) 

i i 

where P(0) is the common core 

D(0) = ConvCone({(l, 0, 0, 1), (0, 1, 0, 1), (0, 0, 1, 0), (0, 0, 0, 1), (-1, -1, 0, -1)}) (81) 
For the unlimited data case d — > oo 

D C (P) = {(mo,m u s,w) \ s > 0} n (82) 
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n[ J D (0) + ConvCone({(K(Q.,||P.,), K(Q m .\\P,.), K(Q..\\P..), 0)} Q )] 



(83) 



where D (0) is the extended common core 



D o (0) =D(0) + Cone({(0, 0, -1, 1)}) 



(84) 



and Q runs over all b X 6 X probability matrices. 



In light of theorem I10.2L theorem 19.31 can be recast as 



Theorem 10.3: For any probability matrices P u P 2 D C {P 1 x P 2 ) = L> c (Pi) + D C (P 2 ) 



We consider the approximate nearest neighbor problem in a probabilistic setting. Using several 
coordinates at once enables asymptotically better approximate nearest neighbor algorithms than 
using them one at a time. The performance is bounded by, and tends to, a newly defined bucketing 
information function. Thus bucketing coding and information theory play the same role for the 
approximate nearest neighbor problem that Shannon's coding and information theory play for 
communication. 
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Appendix I 
Bucketing Information Proofs 

Proof of Lemma 19.11 Proof: Lemma (16.11) implies that 



so for < n < 1 



K(R^..\\P.)< °i2 K(Ri,.\\P..) 

i=0 



boh-l 



(85) 



l-fjL)K(R* t ..\\R.)- K(Ri,.\\R.) < -fi £ P. 



(86) 



i=0 



i=0 



and only one i is necessary. The connection between definition 19. II and (l65l) is through 



I(P, Ao, Ai, /i) = max 

{r^Q,}, L i=0 
r* = 1 



6 6l-l 



£ rjA ^(g v *||P.*) + A 1 K(g^.||P,.)+ (8V) 



+(i- /U )K(^r J g v .|p..)-K(g v .||p.. 



(88) 



The set G is b bi dimensional, so by Caratheodory's theorem any point on the boundary of its 
convex hull is a convex combination of G points. ■ 

Proof of lemma 19.21 Proof: Non-negativity follows by taking Q = P. Monotonicity 
,convexity and (1671) are by definition. 

When < // < 1 dill) is valid and ([68]) is clear. When \i > 1 



6061-1 



i=0 



J(P, A ,Ai,/i) < max £ A K(P^||P,) + A 1 K(P ii ,.||P t .) -^(Pv.||P.; 



(89) 



so direction <= of (|68l) is true. On the other hand assume that for some Q 

K{Q..\\P.) < \ K{Q4P*) + A 1 JC(g*.||P ! .) 
Inserting r j k = eq jk , r 1Jk = p jk - eq jk into definition |9T] gives 



(90) 



I(P,\ ,\ 1 ,ri >e[\ K(Q4P*) + \ 1 K(Q4P*.)-K(Q..\\P.)} + 



(91) 
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+(1 - e) XoK(P. m \\P. m ) + AiAT(P».||P».) - K{P..\\R, 



(92) 



where P = (P — eQ) /(l — e) = P + e(P — Q)/(l — e). The Kullback-Leibler divergence between 
P and P is second order in e, and the same holds for their marginal vectors. Hence for a small 
e > I(P, A , Ai, ji) > 0, and the proof of (|68|) is done. 
Lemma 16.11 implies 

i^P^P.*), AX?V||P,) < A' (/.';... P.) (93) 



so d69j) follows from (1681) . 

Now to A = Ai = 1. We want to maximize 

£ [AXP^HP*) + A-(P^.||P„) - A-(P V .||P.)] = m 



Pj*P*k ijfc ii,j*fi,*k 



The rightmost sum is nonnegative, and for any {r*jk}jk it can be made by choosing 



r i,jk 



r*,jk i = j + b k 
otherwise 



(94) 



Hence we want to maximize 



r *,jk In P3h) + (1 - fi) r *,jk lnr* (J - fc 



(95) 



jfc Pj*P*k j k 

When < /i < 1 both sums can be simultaneously maximized by concentrating r in one place. 
When fi > 1 the maximized function is concave in {r*jk}jk, and Lagrange multipliers reveal 
the optimal choice 

VPj.P*fe/ 



T*,jk 



(96) 



Proof of theorem 19.31 Proof: Obviously /(Pi x P 2 ,A ,Ai,/i) > I (Pi, Ao, Ai, //) 
J(P 2 , Aq, Ai,/i). The other direction is the challenge. Denote P = P x x P 2 : 



Pj\k\fak% — Pl,jikiP2,j 2 k2 



(97) 



For any { r ijij 2kl k 2 }i,jij 2 kik 2 



(/' - l)K(R* r ..\\P....) + E A / , .. 



(// - 1) E r #tjlkl ^ In r *' Jlfcl ** + E r i,hki** m r ' Jlfcl ** \- 

j lkl Phhki ij lkl r i,****PlJiki 

r *,jlkij 2 k 2 v"^ ». 1^, r i,jikij 2 k 2 



j 1 k 1 j 2 k 2 ' *,j 1 k 1 **P2,j 2 k 2 i,j!kij2k 2 ' i,hki**P2,j 2 k 2 

By definition 

(/i - 1) r *,jik!** In r *' Jlfcl " + E r i,iifci** ln r '' Jlfcl " — > 

j lkl Phjlkl ij lkl r i,****Pl,jik! 

> A E r * 

In mi*** + Ai £ r . ^ ln 'W* _ /( p i; Aq; Ai; /i) 
(A* " 1) E In - n ' Jlfc " 2fc2 + E n M k 2 In - 2 > 

j 2 fc 2 *,jik!**P2,j 2 k 2 i,j 2 k 2 i,ji,ki**P2,j 2 k2 

\ \ l r i,jikij 2 * . \ i r i,jiki*k 2 

— A 2^ r i,jlkij 2 * in r Al r i,hk\*k 2 m 

ij 2 r i,j 1 k 1 **P2,j 2 * jfc 2 r i,j 1 ,k 1 **P2,*k 2 

~ r *,jik!**I(P2, Ao, Al, /i) 

so with help from lemma [67T1 

I'm 1^1 ^ 1~ r *,jikij 2 k 2 \ ^ i r i,jlkij 2 k 2 ^ 

I/ 1 ~ i J r *Jikij 2 k 2 in h 2^ r i,jikij 2 k 2 in > 

jlkij 2 k 2 r *,jik!**P2,j 2 k 2 %,jxkij 2 k 2 r i,j 1 ,k 1 **P2,j2k 2 

In *' J1 * J2 * +Ai E ^ 1 2 - /(P 2 , A , A 1; M ) 

tjij 2 r i,j!***P2,j 2 * i,kxk 2 r i,*,k!**P2,*k 2 

Together 

(/' • 1) A' (/.'...... P....) • E 7 ^ 7 ''- • p -0 > 

£ 

> Ao E + Ai E K{Ph,*-*- ll-f*-*-) — I {Pi 7 Aq, Ai, /i) — /(-P2, Ao, Ai, /i) 

Notice that we have used the fact that for < [i < 1 there is only one i. 



Appendix II 
Bucketing Codes and Information Proofs 

Proof of theorem riO.il Proof: Without restricting generality let v = 1. Let (B 
be subset pairs. Denote 

i-l 

Bi = i?o,i x -Bi,? \ [J B , t x B 1)t 
t=o 

so the success probability is S = X^Pb, Insert 



r 



i,jk 



into definition 19.11 Lemma I6T1 implies 



P -f (./•/■•!( l'>, 
otherwise 



K(Ri t .*\\P. # ) = ^ r^m — > -r^^ lnp Bo 

j6B ,i r i,**Pj* 

Similarly 



[\oK(R it .*\\P.*) + \iK(Ri t *.\\P*.)) > -^Vj,** (A hip,B Oi * + Ailnp* Bl 
Recall that the work is W = J2i Wi where 

Wi = max (noPBoj*, ntf^, ri PB 0]i *riiP*B ,i) 

Our parameters satisfy 

(A ,A 1 )GConv({(l,0),(0,l),(l,l)}) 

hence 

In > A ln(n ^B 0ii *) + A x m(nip* Bli ) 
- A In p Bo - Ai In . > A In n + A x In n x - In W { 

Clearly 

P..) = - In 5 
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K(R i! ..\\R.) = - ^r ijfc In (r^S) = - In S - ^ r i,** In r i>m (108) 

ijk i 

Now all the pieces come together: 

I(P, A , Ai,/i) > A lnn + A x lnni - ^r^** In TV; + ulnS + ^r^** lnr^* = 

i i 

= A In n + Ai In rii + u In S + ^ r i)M In ■ 



Another call of duty for lemma 16.11 produces 

5>,«ln^>-ln^ (109) 

i 1 

■ 

Lemma 2.1: Suppose that 

(Pi, tZ, no,!, tii,!, Si, Wi), (P 2 , d,n 0t2 ,ni >2 ,S 2 , W 2 ) (110) 

are attainable. Then 

(Pi x P2,d,no,in 0t2 ,n 1A n 1: 2,S 1 S2,W 1 W 2 ) (111) 

is attainable, where x is tensor product. In particular when P\ = P 2 = P for any k±, k 2 > we 
attain 

(P, (h + fc a )d, nfenfo, S^St, WtWt) (112) 

In particular the closure of the log-attainable set D C (P) is convex. 

Proof: Tensor product the codes. ■ 
Lemma 2.2: Suppose that 

(P,d 1 ,n ,n 1 ,S 1 ,W 1 ), (P,d 2 ,n , ni ,S 2 ,W 2 ) (113) 

are attainable. Then 

(P, d 1 + d 2 , n , m, Si + S 2 - SiS 2 , Wi + W 2 ) (1 14) 
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is attainable. In particular for any Si < Si < 1 

(\nn 0l \nn u -\nSi/SiMWi/Si) G D C (P) (115) 
Proof: Concatenating the codes shows the first claim. Concatenating T times the fc'th tensor 
power of the first code shows that 

P, Tdl, n k , n k , 1 - (l - S k ) T , TW^j (116) 

is attainable. Taking T = \S\ h ] and letting k — ► oo finishes the proof. ■ 
Proof of theorem 110.21 Proof: First let us show that the two representations are equiv- 
alent. Denote the right hand side of (1791) by P. It is the dual of its dual: 

E = {(m , rrii, s, w) \ a m + oiTOi — (3s — < 1 
Voo, aci, (3, 7, R such that oq, ai < 7 < oq + oi, /3, 7 > 0, 
o E 11^-*) + «i E ^W,*- H p *-) + (7 - ||P.) - 7 E K{Ri,. ||P..) < 1} 

i i j 

When 7 = it forces o = 01 = and we are left with —(3s < 1 for all (3 > 0, i.e. s > 0. 
When 7 > we can divide by it, denote Ao = 00/7, Ai = 01/7, /x = /3/7 an d And that I/7 > 7 
so P equals the right hand side of (1771) . 

Theorem 110.11 implies that P C (P) C P. We will prove the inverse inclusion by construction. 
The single big bags pair code 

P = {0,1,...,6 -1}, Pi = {0,l,...A-l} (117) 

shows that P(0) C P(P) . Now let {rijk}ijk attain the bucketing information value I. For 
dimension d choose integers {di jk}ijk such that „ = d and 

r ijjk d - 1 < dij k < r i>jk d + 1 (118) 

Let us define a bucket pair 

Po,0 = { Xq 



Ci + 1 



Vij E (^==j) = dij, } (H9) 

i=c 4 +l 
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Bo, 



Xi 



Ci+1 



Vik ( x hi == k ) = d i,*k 

l=Ci + l 



(120) 



where q = Yh=o di,** In words we want x to contain exactly rf j* j-values in its first d ,* 
coordinates, etc. The bucket size is 



n 



di 



Up 



Ilj di,j*- j 



p*b ,i = n 



di 



-iUp 



*k 



(121) 



(122) 



Let us add T — 1 similar buckets. They are generated by randomly permuting the coordinates 
1, 2, . . . , d. Let n = 1/pb ,o* > n i = Vp*b ,i so that the work is W = T . A lower bound of 
the average success probability of this random bucketing code is 



E[S] > U 1-(1-V/U) 



r 



(123) 



where 



U = 



Y[jkd*,jk* jk 

is the probability that the special pair obtains coordinate pair (j, k) exactly d*j k times, and 



(124) 



d I 



^i,jk 

'jk 



(125) 



is the probability that the special pair obtains coordinate pair (j, k) exactly d^ ]k times in 
coordinate subset number i. Of course there exists a deterministic code at least as successful as 
the average code. 

It is reasonable to take T = \U/V~\ . Stirling's approximation implies 



lim - In n = V n In T ' h3 * 



lim - In m = V r;* fc In Tl, * k 



(126) 
(127) 
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lim -l\nU = Yn ik ln' 1 -^ (128) 
lim Zli n ^ = ^r iijfc ln^- (129) 



Hence 



liminf 3(A lnn + Ailnni + /iln5-lnW) > (130) 

<2— >oo d 

> lim -(A lnn + Ailnni + (//-l)lii[/ + lnV") = 1 (131) 
There remains the unlimited data formula (f82l) . Lemmas 12.21 shows that 

D C {P) = D (P) n {(m , mi, s, w | s > 0} (132) 

D (P) = £> o c (0) + Cone({(0, 0, -1, 1)}) (133) 

Clearly D (P) is convex, contains the origin, and any point (a , cti, /3, 7) in its dual satisfies 
j3 < 7 . Hence // = /3/7 < 1 so by lemma I97T1 only one 2 term is needed, as long as we use the 

full A)(0). ■ 



